class: center, middle, inverse, title-slide .title[ # ISA 444: Business Forecasting ] .subtitle[ ## 05: Time Series Summaries ] .author[ ###
Fadel M. Megahed, PhD
Endres Associate Professor
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Spring 2023 ] --- # Quick Refresher from Last Class ✅ Examine a line chart for trends, seasonality, and cycles. ✅ Explain the grammar of graphics and how it can be used to create time series plots in
. ✅ Create interactive time-series plots by using the [plotly](https://plotly.com/ggplot2/getting-started/) package
. --- # Learning Objectives for Today's Class - Use numerical summaries to describe a time series. - Explain what do we mean by correlation. - Apply transformations to a time series. --- class: inverse, center, middle # A Summarizing Time-Series Data --- # Measures of Average **Mean:** Given a set of `\(n\)` values `\(Y_1, \, Y_2, \, \dots, \, Y_n\)`, the arithmetic mean can be computed as: `$$\bar{Y} = \frac{Y_1 + Y_2 + \dots + Y_n}{n} = \frac{1}{n}\sum_{i=1}^{i=n}Y_i.$$` <br> **Order Statistics:** Given a set of `\(n\)` values `\(Y_1, \, Y_2, \, \dots, \, Y_n\)`, we place them in an ascending order to define the order statistics, written as `\(Y_{(1)}, \, Y_{(2)}, \, \dots, \, Y_{(n)}.\)` **Median:** - If `\(n\)` is odd, `\(n = 2m + 1\)` and the median is `\(Y_{(m+1)}\)`. - If `\(n\)` is even, `\(n = 2m\)` and the median is the average of the two middle numbers, i.e., `\(\frac{1}{2}[Y_{(m)} + Y_{(m+1)}]\)`. --- # Measures of Variation The **range** denotes the difference between the largest and smallest value in a sample: `$$Range = Y_{(n)} - Y_{(1)}.$$` The **deviation** is defined as the difference between a given observation `\(Y_i\)` and the mean `\(\bar{Y}\)`. The **mean absolute deviation (MAD)** is the average deviations about the mean, irrespective of their sign: $$ \text{MAD} = \frac{\sum_{i=1}^{i=n}|d_i|}{n}. $$ The **variance** is the average of the squared deviations around the mean: $$ S^2 = \frac{\sum_{i=1}^{i=n}d_i^2}{n-1}. $$ --- # The GameStop Short Squeeze .center[
] --- ## Summarizing the GME Short Squeeze: Avg/Var Measures .pull-left-2[ .font80[ ```r gme_get = tidyquant::tq_get(x = 'GME', from = '2020-01-01') |> dplyr::select(date, adjusted) |> dplyr::mutate( year = lubridate::year(date), month = lubridate::month(date, label = T) ) gme_get ``` ] ] .pull-right-2[ ``` ## # A tibble: 779 × 4 ## date adjusted year month ## <date> <dbl> <dbl> <ord> ## 1 2020-01-02 1.58 2020 Jan ## 2 2020-01-03 1.47 2020 Jan ## 3 2020-01-06 1.46 2020 Jan ## 4 2020-01-07 1.38 2020 Jan ## 5 2020-01-08 1.43 2020 Jan ## 6 2020-01-09 1.39 2020 Jan ## 7 2020-01-10 1.36 2020 Jan ## 8 2020-01-13 1.36 2020 Jan ## 9 2020-01-14 1.18 2020 Jan ## 10 2020-01-15 1.15 2020 Jan ## # … with 769 more rows ``` ] --- count: false ## Summarizing the GME Short Squeeze: Avg/Var Measures .pull-left-2[ .font80[ ```r gme_get = tidyquant::tq_get(x = 'GME', from = '2020-01-01') |> dplyr::select(date, symbol, adjusted) |> dplyr::mutate( year = lubridate::year(date), month = lubridate::month(date, label = T) ) *gme_summary = * gme_get |> * dplyr::group_by(symbol) gme_summary ``` ] ] .pull-right-2[ ``` ## # A tibble: 779 × 5 ## # Groups: symbol [1] ## date symbol adjusted year month ## <date> <chr> <dbl> <dbl> <ord> ## 1 2020-01-02 GME 1.58 2020 Jan ## 2 2020-01-03 GME 1.47 2020 Jan ## 3 2020-01-06 GME 1.46 2020 Jan ## 4 2020-01-07 GME 1.38 2020 Jan ## 5 2020-01-08 GME 1.43 2020 Jan ## 6 2020-01-09 GME 1.39 2020 Jan ## 7 2020-01-10 GME 1.36 2020 Jan ## 8 2020-01-13 GME 1.36 2020 Jan ## 9 2020-01-14 GME 1.18 2020 Jan ## 10 2020-01-15 GME 1.15 2020 Jan ## # … with 769 more rows ``` ] --- count: false ## Summarizing the GME Short Squeeze: Avg/Var Measures .pull-left-2[ .font80[ ```r gme_get = tidyquant::tq_get(x = 'GME', from = '2020-01-01') |> dplyr::select(date, symbol, adjusted) |> dplyr::mutate( year = lubridate::year(date), month = lubridate::month(date, label = T) ) gme_summary = gme_get |> dplyr::group_by(symbol) |> * dplyr::summarise( * ajusted_avg = mean(adjusted), * adjusted_med = median(adjusted), * adjusted_var = var(adjusted), * adjusted_sd = sd(adjusted) * ) gme_summary |> t() # transposing for printout ``` ] ] .pull-right-2[ ``` ## [,1] ## symbol "GME" ## ajusted_avg "24.44359" ## adjusted_med "26.05" ## adjusted_var "358.923" ## adjusted_sd "18.94526" ``` ] --- count: false ## Summarizing the GME Short Squeeze: Avg/Var Measures .pull-left-2[ .font80[ ```r gme_get = tidyquant::tq_get(x = 'GME', from = '2020-01-01') |> dplyr::select(date, symbol, adjusted) |> dplyr::mutate( year = lubridate::year(date), month = lubridate::month(date, label = T) ) gme_summary = gme_get |> * dplyr::group_by(symbol, year) |> dplyr::summarise( ajusted_avg = mean(adjusted), adjusted_med = median(adjusted), adjusted_var = var(adjusted), adjusted_sd = sd(adjusted) ) gme_summary ``` ] ] .pull-right-2[ ``` ## # A tibble: 4 × 6 ## # Groups: symbol [1] ## symbol year ajusted_avg adjusted_med adjusted_var adjusted_sd ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 GME 2020 1.79 1.19 1.17 1.08 ## 2 GME 2021 42.4 44.5 207. 14.4 ## 3 GME 2022 29.6 28.9 32.9 5.74 ## 4 GME 2023 19.9 20.5 4.54 2.13 ``` ] --- count: false ## Summarizing the GME Short Squeeze: Avg/Var Measures .pull-left-2[ .font80[ ```r gme_get = tidyquant::tq_get(x = 'GME', from = '2020-01-01') |> dplyr::select(date, symbol, adjusted) |> dplyr::mutate( year = lubridate::year(date), month = lubridate::month(date, label = T) ) gme_summary = gme_get |> * dplyr::group_by(symbol, year, month) |> dplyr::summarise( ajusted_avg = mean(adjusted), adjusted_med = median(adjusted), adjusted_var = var(adjusted), adjusted_sd = sd(adjusted) ) print(gme_summary, n=15) ``` ] ] .pull-right-2[ ``` ## # A tibble: 38 × 7 ## # Groups: symbol, year [4] ## symbol year month ajusted_avg adjusted_med adjusted_var adjusted_sd ## <chr> <dbl> <ord> <dbl> <dbl> <dbl> <dbl> ## 1 GME 2020 Jan 1.22 1.16 0.0321 0.179 ## 2 GME 2020 Feb 0.981 1.00 0.00410 0.0640 ## 3 GME 2020 Mar 1.00 0.992 0.00522 0.0722 ## 4 GME 2020 Apr 1.15 1.20 0.0742 0.272 ## 5 GME 2020 May 1.16 1.12 0.0164 0.128 ## 6 GME 2020 Jun 1.15 1.14 0.00545 0.0738 ## 7 GME 2020 Jul 1.03 1.03 0.00128 0.0358 ## 8 GME 2020 Aug 1.20 1.16 0.0188 0.137 ## 9 GME 2020 Sep 2.13 2.17 0.124 0.352 ## 10 GME 2020 Oct 3.04 3.03 0.221 0.470 ## 11 GME 2020 Nov 3.11 2.92 0.181 0.425 ## 12 GME 2020 Dec 4.15 4.06 0.410 0.640 ## 13 GME 2021 Jan 19.9 9.78 648. 25.5 ## 14 GME 2021 Feb 17.9 13.1 117. 10.8 ## 15 GME 2021 Mar 47.1 48.6 133. 11.5 ## # … with 23 more rows ``` ] --- class: inverse, center, middle # Correlation --- # The Pearson Correlation Coefficient - **Correlation:** measures the strength of the **linear relationship** between two quantitative variables. - It can be computed using the `cor()` from base R. Mathematically speaking, the pearson correlation coefficient, `\(r\)`, can be computed as `$$r = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2 \sum_{i=1}^{n}(Y_i - \bar{Y})^2}}$$` - Do **not** use the Pearson Correlation coefficient if both variables are not quantitative. Instead, refer to the `mixed.cor()` from the [psch package](https://personality-project.org/r/psych/help/mixed.cor.html) to compute the correlations for mixtures of continuous, polytomous, and/or dichotomous variables. - You should supplement **any descriptive summaries with visualizations** to ensure that you are able to interpret the computations correctly. --- ## Supplement Summaries with Viz: Anscombe's Dataset **In a seminal paper, Anscombe stated:** > **Few of us escape being indoctrinated with these notions:** > - numerical **calculations are exact, but graphs are rough**; > - for any particular kind of **statistical data there is just one set of calculations constituting a correct statistical analysis**; > - performing **intricate calculations is virtuous**, whereas **actually looking at the data is cheating**. He proceeded by stating that > a computer should **make both calculations and graphs**. Both sorts of output should be studied; each will contribute to understanding. Now, let us consider his four datasets, each consisting of eleven (x,y) pairs. .footnote[ <html> <hr> </html> **Source:** Anscombe, Francis J. 1973. "Graphs in Statistical Analysis." *The American Statistician* 27 (1): 17–21. ([PDF Link](https://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf)). --- count: false ## Supplement Summaries with Viz: Anscombe's Dataset .font80[
] --- count: false ## Supplement Summaries with Viz: Anscombe's Dataset .font80[
] --- count: false ## Supplement Summaries with Viz: Anscombe's Dataset <img src="data:image/png;base64,#05_ts_summary_files/figure-html/anscombe4-1.png" width="70%" style="display: block; margin: auto;" /> --- # Kahoot Competition #02 To assess your understanding and retention of the topics covered so far, you will **compete in a Kahoot competition (consisting of 6 questions)**: - Go to <https://kahoot.it/> - Enter the game pin, which will be shown during class - Provide your first (preferred) and last name - Answer each question within the allocated time window (**fast and correct answers provide more points**) **Winning the competition involves having as many correct answers as possible AND taking the shortest duration to answer these questions.** The winner
of the competition will receive a $10 Starbucks gift card. Good luck!!! .footnote[ <html> <hr> </html> **P.S:** The Kahoot competition will have **no impact on your grade**. It is a **fun** way of assessing your knowledge, motivating you to ask questions about topics covered that you do not have a full understanding of it, and providing me with some data to pace class. ] --- class: inverse, center, middle # Transformations --- # Guidelines for Transforming Time Series Data <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../../figures/transformations.png" alt="A classification of common transformation approaches for time series data" width="100%" /> <p class="caption">A classification of common transformation approaches for time series data</p> </div> .footnote[ <html> <hr> </html> My (incomplete) attempt to provide you with a taxonomy for time series data transformations. ] --- # Stablize the Mean: Differencing The plot below shows the number of murdered women per 100,000 people in the U.S. From the plot, we can see that the ts is not stationary. <img src="data:image/png;base64,#05_ts_summary_files/figure-html/murdered_women-1.png" width="70%" style="display: block; margin: auto;" /> --- count:false # Stablize the Mean: Differencing The plot below shows the **first nonseasonal difference**. From the plot, we can see that differencing has reduced the nonstationary nature of the time-series. <img src="data:image/png;base64,#05_ts_summary_files/figure-html/murdered_women2-1.png" width="70%" style="display: block; margin: auto;" /> --- # Computing the First Nonseasonal Difference The change in the time series from one period to the next is known as the first nonseasonal difference. It can be computed as follows: `$$DY_t = Y_t - Y_{t-1}$$` .pull-left-2[ .font80[ ```r women_murdered_filtered = women_murdered |> # dataset read in previous lines of code dplyr::filter(year > 2000) print(women_murdered_filtered) ``` ] ] .pull-right-2[ ``` ## # A tibble: 6 × 3 ## country year murders_per_100000 ## <chr> <dbl> <dbl> ## 1 United States 2001 2.77 ## 2 United States 2002 2.64 ## 3 United States 2003 2.57 ## 4 United States 2004 2.53 ## 5 United States 2005 2.6 ## 6 United States 2006 2.58 ``` ] --- count:false # Computing the First Nonseasonal Difference The change in the time series from one period to the next is known as the first nonseasonal difference. It can be computed as follows: `$$DY_t = Y_t - Y_{t-1}$$` .pull-left-2[ .font80[ ```r women_murdered_filtered = women_murdered |> # dataset read in previous lines of code dplyr::filter(year > 2000) *women_murdered_filtered = * women_murdered_filtered |> * dplyr::mutate( * diff1 = murders_per_100000 - dplyr::lag(murders_per_100000, n = 1), * diff2 = c(NA, diff(murders_per_100000, lag = 1)) * ) print(women_murdered_filtered) ``` ] ] .pull-right-2[ ``` ## # A tibble: 6 × 5 ## country year murders_per_100000 diff1 diff2 ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 United States 2001 2.77 NA NA ## 2 United States 2002 2.64 -0.130 -0.130 ## 3 United States 2003 2.57 -0.0700 -0.0700 ## 4 United States 2004 2.53 -0.0400 -0.0400 ## 5 United States 2005 2.6 0.0700 0.0700 ## 6 United States 2006 2.58 -0.0200 -0.0200 ``` ] --- # Computing the First Seasonal Difference If your data exhibits a seasonal pattern, as illustrated in Slides 6 and 18 in [03_ts_viz.html](https://fmegahed.github.io/isa444/spring2023/class03/03_ts_viz.html), you should employ a **seasonal differencing approach**, you should subtract the difference between an observation and the previous observation from the same season. Let `\(m\)` denote the number of seasons, e.g. `\(m=4\)` for quarterly data. In such a case, the seasonal difference is computed as follows: `$$DY_{t-m} = Y_t - Y_{t-m}$$` **Note:** In
, this can be computed by assigning the `\(x\)` argument in the `dplyr::lag()` to `\(m\)`, or by setting the `\(lag\)` argument in the `diff()` to `\(m\)`. --- # Stablize the Variance: Power Transformations .left-code[ .font80[ ```r # The built-in JohnsonJohnson dataset forecast::autoplot(JohnsonJohnson) + ggplot2::geom_point() + # adding points ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n=6)) + ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) + ggplot2::theme_bw() ``` ] ] .right-plot[ <img src="data:image/png;base64,#05_ts_summary_files/figure-html/jj1_out-1.png" width="70%" style="display: block; margin: auto;" /> ] --- count: false # Stablize the Variance: Power Transformations .left-code[ .font80[ ```r # The built-in JohnsonJohnson dataset *forecast::autoplot(sqrt(JohnsonJohnson)) + ggplot2::geom_point() + # adding points ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n=6)) + ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) + ggplot2::theme_bw() ``` ] ] .right-plot[ <img src="data:image/png;base64,#05_ts_summary_files/figure-html/jj2_out-1.png" width="70%" style="display: block; margin: auto;" /> ] --- count: false # Stablize the Variance: Power Transformations .left-code[ .font80[ ```r # The built-in JohnsonJohnson dataset *forecast::autoplot(log(JohnsonJohnson)) + ggplot2::geom_point() + # adding points ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n=6)) + ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) + ggplot2::theme_bw() ``` ] ] .right-plot[ <img src="data:image/png;base64,#05_ts_summary_files/figure-html/jj3_out-1.png" width="70%" style="display: block; margin: auto;" /> ] --- # A Note on the Log Transform The log transformation can be computed as follows: `$$L_t = \ln{(Y_t)}$$` Note that the `log()` in R takes the natural logarithm as its default base, i.e., would transform a variable/statistic based on the above equation. The reverse transformation using the exponential function is: `$$e^{L_t} = e^{\ln{(Y_t})} = Y_t$$` --- count: false # The Log Transform - The primary purpose of the log transform is to **convert exponential growth into linear growth.** - The transform often has the **secondary purpose of balancing the variance.** - Difference in logs and growth rate transformations produce similar results and interpretations (see next slides). --- # Stabilizing the Mean and Variance The **first nonseasonal difference in logarithms** represents the logarithm of the ratio `$$L_t = \ln{(\frac{Y_t}{Y_{t-1}})} = \ln{(Y_t)} - \ln{(Y_{t-1})}$$` In the absence of seasonality, the **growth rate** for a time series is given by `$$GY_t = 100 \frac{Y_t - Y_{t-1}}{Y_{t-1}}$$` --- count: false # Stabilizing the Mean and Variance .left-code[ .font80[ ```r # The built-in JohnsonJohnson dataset forecast::autoplot( * log(JohnsonJohnson) - log(stats::lag(JohnsonJohnson)) ) + ggplot2::geom_point() + # adding points ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n=6)) + ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) + ggplot2::theme_bw() ``` ] ] .right-plot[ <img src="data:image/png;base64,#05_ts_summary_files/figure-html/jj4_out-1.png" width="70%" style="display: block; margin: auto;" /> ] --- count: false # Stabilizing the Mean and Variance .left-code[ .font80[ ```r # The built-in JohnsonJohnson dataset forecast::autoplot( * (JohnsonJohnson - stats::lag(JohnsonJohnson))/ stats::lag(JohnsonJohnson) ) + ggplot2::geom_point() + # adding points ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n=6)) + ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) + ggplot2::theme_bw() ``` ] ] .right-plot[ <img src="data:image/png;base64,#05_ts_summary_files/figure-html/jj5_out-1.png" width="70%" style="display: block; margin: auto;" /> ] --- # A Practical Note about Growth Rates
−
+
05
:
00
.panelset[ .panel[.panel-name[Activity] > Over the next 5 minutes, please answer the question in each tab. ] .panel[.panel-name[Q1] - **Question 1:** Let us say that an investor purchased 10 stocks of \$GME, on 2021-01-29, at 325/stock. The next trading day, 2021-02-01, the GME stock closed at $225. Compute the growth rate in their portfolio worth (assuming it only has the GME stock) over this time period. .can-edit.key-activity4_q1[ **What is their growth rate?** .font70[(Insert below)] - Edit me ] ] .panel[.panel-name[Q2] - **Question 2:** Let us say that the growth rate, `\(GY_t = -g\)`. Now let us assume that the GME stock went up by `\(g\)` (i.e., if it went down 10%, it increased by 10% over the next trading day). What is the value of the investor's portfolio by stock market closing on 2021-02-02? .can-edit.key-activity4_q2[ **What is their growth rate?** .font70[(Insert below)] - Edit me ] ] ] --- # A Live Demo In this live coding session, we will capitalize on the `mutate()` from [tidyverse](https://www.tidyverse.org/) to create transformations for multiple time series. Specifically, we will use the `tq_get()` from [tidyquant](https://business-science.github.io/tidyquant/) to extract data about the following cryptocurrencies (a) [Cardano](https://cardano.org/) (ADA), (b) [Chainlink](https://chain.link/) (LINK), and (c) [Zilliqa](https://www.zilliqa.com/) (ZIL). We will compute: - Growth Rates - Natural log - Log Differences - `\([0-1]\)` Scaling Obviously, we will have to ensure that these transformations are computed for each coin separately. For the purpose of this activity, let us extract the data from 2023-01-01 to 2023-02-05. --- class: inverse, center, middle # Recap --- # Summary of Main Points By now, you should be able to do the following: - Use numerical summaries to describe a time series. - Explain what do we mean by correlation. - Apply transformations to a time series. --- # Things to Do to Prepare for Our Next Class - Go over your notes and read through [Chapter 2.1-2.5 of our reference book](https://cdn.shopify.com/s/files/1/0859/4364/files/Part_I_POBF-_A_First_Course_in_Forecasting_1.pdf?612). - Complete [Assignment 04](https://miamioh.instructure.com/courses/188655/assignments/2377018) on Canvas.